CP610 Data Analysis - Group 8 - Final Project
Project title: Breast Cancer Wisconsin (Original)
Student information¶
- Gia Phat Huynh (huyn8900@mylaurier.ca)
- Thai Son Truong (truo1520@mylaurier.ca)
- The Minh Nguyen (nguy6401@mylaurier.ca)
I. Introduction
Cancer remains one of the most formidable health challenges worldwide, affecting millions of lives annually. Among the various types of cancer, breast cancer is particularly prevalent, impacting millions of women globally. It is characterized by the uncontrolled growth of abnormal cells in the breast tissue, which can metastasize to other parts of the body if not detected and treated early. Timely and accurate diagnosis is paramount for effective treatment and improved patient outcomes. Early detection and accurate diagnosis significantly enhance survival rates and the effectiveness of treatment.
In the realm of medical diagnostics, machine learning algorithms have emerged as vital tools in assisting healthcare professionals with early detection and classification of diseases, including cancer. These algorithms analyze complex datasets to identify patterns and anomalies that may indicate the presence of cancer. The Breast Cancer Wisconsin (Original) dataset, which includes features from fine needle aspirate (FNA) samples of breast masses, is a valuable resource for developing and testing these diagnostic algorithms. By leveraging machine learning techniques, it is possible to enhance the accuracy and efficiency of breast cancer diagnosis, ultimately contributing to better patient care and outcomes. This report aims to explore the application of machine learning models to this dataset, with the goal of improving the prediction and classification of benign and malignant breast tumors.
II. Dataset Introduction
The Breast Cancer Wisconsin (Original) dataset is a widely utilized dataset in medical research and machine learning, particularly in the field of breast cancer diagnosis. Collected by Dr. William H. Wolberg at the University of Wisconsin Hospitals, Madison, from 1989 to 1991, it comprises 699 instances of fine needle aspirate (FNA) samples from breast masses. Each instance is characterized by 30 features, including attributes like clump thickness, uniformity of cell size and shape, marginal adhesion, and bare nuclei. The goal is to classify these samples into benign or malignant categories. This dataset has been instrumental in developing and testing various algorithms for accurate and early detection of breast cancer.
Additional Information
Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself:
Group 1: 367 instances (January 1989) Group 2: 70 instances (October 1989) Group 3: 31 instances (February 1990) Group 4: 17 instances (April 1990) Group 5: 48 instances (August 1990) Group 6: 49 instances (Updated January 1991) Group 7: 31 instances (June 1991) Group 8: 86 instances (November 1991)
Total: 699 points (as of the donated datbase on 15 July 1992)
III. Objectives
The primary objective of this project is to conduct a prediction analysis on the Breast Cancer Wisconsin (Original) dataset. In this notebook, we aim to:
- Showcase the steps involved in data preprocessing, model training, evaluation, and interpretation.
- Provide insights into feature importance and model performance metrics relevant to cancer diagnosis.
- Demonstrate the practical implementation of machine learning models like Random Forest Classifier, Logistic Regression, SVM, etc.
Our steps:
Initially, the dataset will be downloaded and meticulously cleaned to address missing values, duplicates, and anomalies. Following this, a stratified random split will be performed to divide the data into training and test sets in an 80/20 ratio, ensuring preserving the benign and malignant sample distribution.
Next, four different machine learning models (Random Forest Classifier, Logistic Regression, SVM, KNN Classifier) will be applied, and 5-fold cross-validation will be utilized to determine the optimal hyperparameters for each model. These fine-tuned models will then be trained on the training set and evaluated on the test set, using accuracy as the primary metric for assessing the prediction of benign or malignant samples.
Finally, the analysis will conclude with a thorough examination of the process, supported by solid data evidence, to draw insights such as identifying key features that enhance classification accuracy, comparing model performances, and evaluating the impact of various data manipulation techniques. Visualizations will be employed throughout the project to communicate findings and improve the overall analysis effectively.
IV. Data preprocessing
a) Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.exceptions import FitFailedWarning
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from scipy.stats import randint
from sklearn.metrics import accuracy_score
b) Load dataset
column_names=['Sample code number', 'Clump thickness', 'Uniformity of cell size', 'Uniformity of cell shape',
'Marginal adhesion', 'Single epithelial cell size', 'Bare_nuclei', 'Bland chromatin', 'Normal nucleoli',
'Mitoses', 'Class'
]
df = pd.read_csv('breast_cancer_wisconsin_original/breast-cancer-wisconsin.data', header=None, names=column_names)
display(df)
| Sample code number | Clump thickness | Uniformity of cell size | Uniformity of cell shape | Marginal adhesion | Single epithelial cell size | Bare_nuclei | Bland chromatin | Normal nucleoli | Mitoses | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000025 | 5 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 2 |
| 1 | 1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | 2 |
| 2 | 1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | 2 |
| 3 | 1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | 2 |
| 4 | 1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 694 | 776715 | 3 | 1 | 1 | 1 | 3 | 2 | 1 | 1 | 1 | 2 |
| 695 | 841769 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 2 |
| 696 | 888820 | 5 | 10 | 10 | 3 | 7 | 3 | 8 | 10 | 2 | 4 |
| 697 | 897471 | 4 | 8 | 6 | 4 | 3 | 4 | 10 | 6 | 1 | 4 |
| 698 | 897471 | 4 | 8 | 8 | 5 | 4 | 5 | 10 | 4 | 1 | 4 |
699 rows × 11 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 699 entries, 0 to 698 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sample code number 699 non-null int64 1 Clump thickness 699 non-null int64 2 Uniformity of cell size 699 non-null int64 3 Uniformity of cell shape 699 non-null int64 4 Marginal adhesion 699 non-null int64 5 Single epithelial cell size 699 non-null int64 6 Bare_nuclei 699 non-null object 7 Bland chromatin 699 non-null int64 8 Normal nucleoli 699 non-null int64 9 Mitoses 699 non-null int64 10 Class 699 non-null int64 dtypes: int64(10), object(1) memory usage: 60.2+ KB
df.drop("Sample code number", axis=1, inplace = True) #Drop irrelavant variable
df.describe()
| Clump thickness | Uniformity of cell size | Uniformity of cell shape | Marginal adhesion | Single epithelial cell size | Bland chromatin | Normal nucleoli | Mitoses | Class | |
|---|---|---|---|---|---|---|---|---|---|
| count | 699.000000 | 699.000000 | 699.000000 | 699.000000 | 699.000000 | 699.000000 | 699.000000 | 699.000000 | 699.000000 |
| mean | 4.417740 | 3.134478 | 3.207439 | 2.806867 | 3.216023 | 3.437768 | 2.866953 | 1.589413 | 2.689557 |
| std | 2.815741 | 3.051459 | 2.971913 | 2.855379 | 2.214300 | 2.438364 | 3.053634 | 1.715078 | 0.951273 |
| min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 |
| 25% | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 2.000000 | 1.000000 | 1.000000 | 2.000000 |
| 50% | 4.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 3.000000 | 1.000000 | 1.000000 | 2.000000 |
| 75% | 6.000000 | 5.000000 | 5.000000 | 4.000000 | 4.000000 | 5.000000 | 4.000000 | 1.000000 | 4.000000 |
| max | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 4.000000 |
#Dataset mentioned the missing values in the "Bare nuclei" variable column. Let's figue it out
print(df['Bare_nuclei'].unique())
['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6']
#Barenuclei has missing value which denoted as '?'
df['Bare_nuclei'].value_counts()
Bare_nuclei 1 402 10 132 2 30 5 30 3 28 8 21 4 19 ? 16 9 9 7 8 6 4 Name: count, dtype: int64
#Query in dataframe
df.query("Bare_nuclei == '?'")
| Clump thickness | Uniformity of cell size | Uniformity of cell shape | Marginal adhesion | Single epithelial cell size | Bare_nuclei | Bland chromatin | Normal nucleoli | Mitoses | Class | |
|---|---|---|---|---|---|---|---|---|---|---|
| 23 | 8 | 4 | 5 | 1 | 2 | ? | 7 | 3 | 1 | 4 |
| 40 | 6 | 6 | 6 | 9 | 6 | ? | 7 | 8 | 1 | 2 |
| 139 | 1 | 1 | 1 | 1 | 1 | ? | 2 | 1 | 1 | 2 |
| 145 | 1 | 1 | 3 | 1 | 2 | ? | 2 | 1 | 1 | 2 |
| 158 | 1 | 1 | 2 | 1 | 3 | ? | 1 | 1 | 1 | 2 |
| 164 | 5 | 1 | 1 | 1 | 2 | ? | 3 | 1 | 1 | 2 |
| 235 | 3 | 1 | 4 | 1 | 2 | ? | 3 | 1 | 1 | 2 |
| 249 | 3 | 1 | 1 | 1 | 2 | ? | 3 | 1 | 1 | 2 |
| 275 | 3 | 1 | 3 | 1 | 2 | ? | 2 | 1 | 1 | 2 |
| 292 | 8 | 8 | 8 | 1 | 2 | ? | 6 | 10 | 1 | 4 |
| 294 | 1 | 1 | 1 | 1 | 2 | ? | 2 | 1 | 1 | 2 |
| 297 | 5 | 4 | 3 | 1 | 2 | ? | 2 | 3 | 1 | 2 |
| 315 | 4 | 6 | 5 | 6 | 7 | ? | 4 | 9 | 1 | 2 |
| 321 | 3 | 1 | 1 | 1 | 2 | ? | 3 | 1 | 1 | 2 |
| 411 | 1 | 1 | 1 | 1 | 1 | ? | 2 | 1 | 1 | 2 |
| 617 | 1 | 1 | 1 | 1 | 1 | ? | 1 | 1 | 1 | 2 |
# Replace '?' with NaN
df['Bare_nuclei'].replace('?', pd.NA, inplace=True)
# Bar plot of missing values
missing_values_count = df.isnull().sum()
plt.figure(figsize=(10, 6))
missing_values_count.plot(kind='barh', color='skyblue')
plt.title('Missing Values in Each Column')
plt.xlabel('Count of Missing Values')
plt.show()
# Convert 'Bare nuclei" columns to numeric, forcing errors to NaN
df['Bare_nuclei'] = pd.to_numeric(df['Bare_nuclei'], errors='coerce')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 699 entries, 0 to 698 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Clump thickness 699 non-null int64 1 Uniformity of cell size 699 non-null int64 2 Uniformity of cell shape 699 non-null int64 3 Marginal adhesion 699 non-null int64 4 Single epithelial cell size 699 non-null int64 5 Bare_nuclei 683 non-null float64 6 Bland chromatin 699 non-null int64 7 Normal nucleoli 699 non-null int64 8 Mitoses 699 non-null int64 9 Class 699 non-null int64 dtypes: float64(1), int64(9) memory usage: 54.7 KB
# Fill missing values with the mean of each column
df['Bare_nuclei'] = df['Bare_nuclei'].fillna(df['Bare_nuclei'].mean())
df['Bare_nuclei'].value_counts()
Bare_nuclei 1.000000 402 10.000000 132 2.000000 30 5.000000 30 3.000000 28 8.000000 21 4.000000 19 3.544656 16 9.000000 9 7.000000 8 6.000000 4 Name: count, dtype: int64
c) Exploratory data analysis
# Suppress specific warning
warnings.simplefilter(action='ignore', category=FutureWarning)
# Crate a box plot for each variable in the dataset
plt.figure(figsize=(10, 8))
sns.boxplot(data=df, palette='Set2', orient='h')
plt.title('Box Plot for Each Variable')
plt.xlabel('Values')
plt.ylabel('Variables')
plt.show()
#Detect outliners
for var in df.columns:
# Calculate quartiles and IQR
q1 = df[var].quantile(0.25)
q3 = df[var].quantile(0.75)
iqr = q3 - q1
# Calculate lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Detect outliers
outliers = df[(df[var] < lower_bound) | (df[var] > upper_bound)]
# Display count of outliers
print(f"Number of outliers for variable {var}: {len(outliers)}")
Number of outliers for variable Clump thickness: 0 Number of outliers for variable Uniformity of cell size: 0 Number of outliers for variable Uniformity of cell shape: 0 Number of outliers for variable Marginal adhesion: 60 Number of outliers for variable Single epithelial cell size: 54 Number of outliers for variable Bare_nuclei: 0 Number of outliers for variable Bland chromatin: 20 Number of outliers for variable Normal nucleoli: 77 Number of outliers for variable Mitoses: 120 Number of outliers for variable Class: 0
data = df
data.head(10)
| Clump thickness | Uniformity of cell size | Uniformity of cell shape | Marginal adhesion | Single epithelial cell size | Bare_nuclei | Bland chromatin | Normal nucleoli | Mitoses | Class | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5 | 1 | 1 | 1 | 2 | 1.0 | 3 | 1 | 1 | 2 |
| 1 | 5 | 4 | 4 | 5 | 7 | 10.0 | 3 | 2 | 1 | 2 |
| 2 | 3 | 1 | 1 | 1 | 2 | 2.0 | 3 | 1 | 1 | 2 |
| 3 | 6 | 8 | 8 | 1 | 3 | 4.0 | 3 | 7 | 1 | 2 |
| 4 | 4 | 1 | 1 | 3 | 2 | 1.0 | 3 | 1 | 1 | 2 |
| 5 | 8 | 10 | 10 | 8 | 7 | 10.0 | 9 | 7 | 1 | 4 |
| 6 | 1 | 1 | 1 | 1 | 2 | 10.0 | 3 | 1 | 1 | 2 |
| 7 | 2 | 1 | 2 | 1 | 2 | 1.0 | 3 | 1 | 1 | 2 |
| 8 | 2 | 1 | 1 | 1 | 2 | 1.0 | 1 | 1 | 5 | 2 |
| 9 | 4 | 2 | 1 | 1 | 2 | 1.0 | 2 | 1 | 1 | 2 |
# Ensure all columns are of numeric type
data = data.apply(pd.to_numeric, errors='coerce')
data['Class'] = data['Class'].map({2: 'Benign', 4: 'Malignant'})
# Count the occurrences of each class
class_counts = data['Class'].value_counts().reset_index()
class_counts.columns = ['Class', 'Count']
fig_class_pie = px.pie(
class_counts,
names='Class',
values='Count',
title='Percentage of Benign and Malignant Cases',
color_discrete_map={'Malignant': 'red', 'Benign': 'skyblue'}
)
fig_class_pie.update_layout(
margin=dict(t=50, l=25, r=25, b=25)
)
fig_class_pie.show()
# Group the data by Clump_Thickness and Class and count the occurrences
count_data = data.groupby(['Clump thickness', 'Class']).size().reset_index(name='Count')
fig = px.bar(
count_data,
x='Clump thickness',
y='Count',
color='Class',
title='Bivariate Analysis: Clump Thickness vs Class',
labels={'Clump_Thickness': 'Clump Thickness', 'Count': 'Count', 'Class': 'Class'},
color_discrete_map={'Malignant': 'red', 'Benign': 'skyblue'},
barmode='stack',
height = 600)
fig.show()
Our verdict: In this bar chart, we can see that the increseard clump thickness is associated to the higher risk of malignancy.
# Group the data by Uniformity of cell size and Class and count the occurrences
count_data = data.groupby(['Uniformity of cell size', 'Class']).size().reset_index(name='Count')
fig = px.bar(
count_data,
x='Uniformity of cell size',
y='Count',
color='Class',
title='Bivariate Analysis: Uniformity of cell size vs Class',
labels={'Uniformity of cell size': 'Uniformity of cell size', 'Count': 'Count', 'Class': 'Class'},
color_discrete_map={'Malignant': 'red', 'Benign': 'skyblue'},
barmode='stack',
height = 600)
fig.show()
# Group the data by Marginal adhesion and Class and count the occurrences
count_data = data.groupby(['Marginal adhesion', 'Class']).size().reset_index(name='Count')
fig = px.bar(
count_data,
x='Marginal adhesion',
y='Count',
color='Class',
title='Bivariate Analysis: Marginal adhesion vs Class',
labels={'Marginal adhesion': 'Marginal adhesion', 'Count': 'Count', 'Class': 'Class'},
color_discrete_map={'Malignant': 'red', 'Benign': 'skyblue'},
barmode='stack',
height = 600)
fig.show()
correlation_matrix = df.corr()
# Create the heatmap using Plotly
fig_corr = px.imshow(
correlation_matrix,
labels=dict(color="Correlation"),
x=correlation_matrix.columns,
y=correlation_matrix.columns,
color_continuous_scale='ice',
height = 600,
title='Correlation Matrix of Heatmap'
)
# Show the heatmap
fig_corr.show()
Our verdict:
The variable "clump thickness" shows the highest positive correlation with the "class" variable (0.7160), indicating that increased clump thickness is associated with a higher likelihood of malignancy.
Similarly, "size_uniformity" and "shape_uniformity" have strong positive correlations with the "class" variable (0.8179 and 0.8189, respectively), suggesting that larger and more irregularly shaped cells are more likely to be malignant.
Additionally, variables such as "bare nuclei," "bland chromatin," and "normal nucleoli" exhibit moderate positive correlations with the "class" variable, highlighting their influence on tumor classification.
In contrast, the variable "mitoses" has a lower positive correlation with the "class" variable (0.4231), indicating a weaker association.
d) Data features and data targets
#Data features and data targets
data_features = df.iloc[:,0:9]
display(data_features)
data_targets = df.iloc[:,9]
display(data_targets)
| Clump thickness | Uniformity of cell size | Uniformity of cell shape | Marginal adhesion | Single epithelial cell size | Bare_nuclei | Bland chromatin | Normal nucleoli | Mitoses | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 5 | 1 | 1 | 1 | 2 | 1.0 | 3 | 1 | 1 |
| 1 | 5 | 4 | 4 | 5 | 7 | 10.0 | 3 | 2 | 1 |
| 2 | 3 | 1 | 1 | 1 | 2 | 2.0 | 3 | 1 | 1 |
| 3 | 6 | 8 | 8 | 1 | 3 | 4.0 | 3 | 7 | 1 |
| 4 | 4 | 1 | 1 | 3 | 2 | 1.0 | 3 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 694 | 3 | 1 | 1 | 1 | 3 | 2.0 | 1 | 1 | 1 |
| 695 | 2 | 1 | 1 | 1 | 2 | 1.0 | 1 | 1 | 1 |
| 696 | 5 | 10 | 10 | 3 | 7 | 3.0 | 8 | 10 | 2 |
| 697 | 4 | 8 | 6 | 4 | 3 | 4.0 | 10 | 6 | 1 |
| 698 | 4 | 8 | 8 | 5 | 4 | 5.0 | 10 | 4 | 1 |
699 rows × 9 columns
0 2
1 2
2 2
3 2
4 2
..
694 2
695 2
696 4
697 4
698 4
Name: Class, Length: 699, dtype: int64
X = data_features
y = data_targets
y = y[X.index] #To ensure targets are aligned with features
e) Train_test_split
#Stratify train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
f) Data standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
V. Model training and evaluation
a) Random Forest Classifier
# Ignore UserWarning
warnings.filterwarnings("ignore", category=UserWarning)
# Ignore FitFailedWarning
warnings.filterwarnings("ignore", category=FitFailedWarning)
#Initial model
rf = RandomForestClassifier()
#Hyperparameter tuning using GridSearchCV with 5-fold cross-validation
param_grid_rf = {
'n_estimators': [10, 50, 100, 200],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [6, 8, 10, 12, 14],
'criterion': ['gini', 'entropy']
}
gcv_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='accuracy')
#Hyperparameter tuning using RandomizedSearchCV with 5-fold cross-validation
param_dist_rf = {
'n_estimators': [50, 100, 200],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [4, 6, 8, 10, 12],
'criterion': ['gini', 'entropy']
}
rcv_rf = RandomizedSearchCV(rf, param_distributions=param_dist_rf, n_iter=50, cv=5, scoring='accuracy', random_state=42)
#Fit model
gcv_rf.fit(X_train, y_train)
rcv_rf.fit(X_train, y_train)
#Best estimator
best_gcv_rf = gcv_rf.best_estimator_
best_rcv_rf = rcv_rf.best_estimator_
#Prediction
y_pred_gcv_rf = best_gcv_rf.predict(X_test)
y_pred_rcv_rf = best_rcv_rf.predict(X_test)
#Evaluate model
accuracy_gcv_rf = accuracy_score(y_test, y_pred_gcv_rf)
accuracy_rcv_rf = accuracy_score(y_test, y_pred_rcv_rf)
#Print result
print(f"Best Random Forest Classifier parameters (GridSeachCV): {gcv_rf.best_params_}")
print("Random Forest Classifier Test Accuary (GridSeachCV): %.5f" %accuracy_gcv_rf)
print(f"Best Random Forest Classifier parameters (RandomizedSeachCV): {rcv_rf.best_params_}")
print("Random Forest Classifier Test Accuary (RandomizedSeachCV): %.5f" %accuracy_rcv_rf)
Best Random Forest Classifier parameters (GridSeachCV): {'criterion': 'gini', 'max_depth': 6, 'max_features': 'log2', 'n_estimators': 200}
Random Forest Classifier Test Accuary (GridSeachCV): 0.95714
Best Random Forest Classifier parameters (RandomizedSeachCV): {'n_estimators': 100, 'max_features': 'sqrt', 'max_depth': 10, 'criterion': 'entropy'}
Random Forest Classifier Test Accuary (RandomizedSeachCV): 0.96429
b) Logistic Regression
# Ignore UserWarning
warnings.filterwarnings("ignore", category=UserWarning)
# Ignore FitFailedWarning
warnings.filterwarnings("ignore", category=FitFailedWarning)
#Initial model
log_reg = LogisticRegression()
# Hyperparameter tuning using GridSearchCV with 5-fold cross-validation
param_grid_log_reg = {
'C': [0.1, 1, 10, 100],
'solver': ['liblinear', 'saga']
}
cv_log_reg = GridSearchCV(log_reg, param_grid_log_reg, cv=5, scoring='accuracy')
#Fit model
cv_log_reg.fit(X_train, y_train)
#Best estimator
best_log_reg = cv_log_reg.best_estimator_
#Prediction
y_pred_log_reg = best_log_reg.predict(X_test)
#Evaluate model
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
#Print result
print(f"Best Logistic Regression parameters: {cv_log_reg.best_params_}")
print("Logistic Regression Test Accuracy: %.5f" %accuracy_log_reg)
Best Logistic Regression parameters: {'C': 1, 'solver': 'liblinear'}
Logistic Regression Test Accuracy: 0.95000
c) Support Vector Machines (SVM)
# Ignore UserWarning
warnings.filterwarnings("ignore", category=UserWarning)
# Ignore FitFailedWarning
warnings.filterwarnings("ignore", category=FitFailedWarning)
#Initial model
svm = SVC()
# Hyperparameter tuning using GridSearchCV with 5-fold cross-validation
param_grid_svm = {
'C': [0.1, 1, 10, 100],
'gamma': [1, 0.1, 0.01, 0.001],
'kernel': ['linear', 'rbf']
}
cv_svm = GridSearchCV(svm, param_grid_svm, cv=5, scoring='accuracy')
#Fit model
cv_svm.fit(X_train, y_train)
#Best estimator
best_svm = cv_svm.best_estimator_
#Prediction
y_pred_svm = best_svm.predict(X_test)
#Evaluate result
accuracy_svm = accuracy_score(y_test, y_pred_svm)
#Print result
print(f"Best SVM parameters: {cv_svm.best_params_}")
print("SVM Test Accuracy: %.5f" %accuracy_svm)
Best SVM parameters: {'C': 10, 'gamma': 1, 'kernel': 'linear'}
SVM Test Accuracy: 0.95714
d) K-nearest Neighbors Classifier
# Ignore UserWarning
warnings.filterwarnings("ignore", category=UserWarning)
# Ignore FitFailedWarning
warnings.filterwarnings("ignore", category=FitFailedWarning)
#Initial model
knn_classifier = KNeighborsClassifier()
# Hyperparameter tuning using GridSearchCV with 5-fold cross-validation
param_grid_knn_classifier = {
'n_neighbors': [3, 5, 7, 9],
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski']
}
cv_knn_classifier = GridSearchCV(knn_classifier, param_grid_knn_classifier, cv=5, scoring='accuracy')
#Fit model
cv_knn_classifier.fit(X_train, y_train)
#Best estimator
best_knn_classifier = cv_knn_classifier.best_estimator_
#Prediction
y_pred_knn_classifier = best_knn_classifier.predict(X_test)
#Evaluate model
accuracy_knn_classifier = accuracy_score(y_test, y_pred_knn_classifier)
#Print result
print(f"Best KNN Classifier parameters: {cv_knn_classifier.best_params_}")
print("KNN Classifier Test Accuracy: %5.f" %accuracy_knn_classifier)
Best KNN Classifier parameters: {'metric': 'manhattan', 'n_neighbors': 3, 'weights': 'distance'}
KNN Classifier Test Accuracy: 1
VI. Feature Importance Visualization
# Feature importance from Random Forest model
feature_importance = best_rcv_rf.feature_importances_
feature_names = X.columns
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
# Visualization
fig_feature_impt = px.bar(importance_df, x='Importance', y='Feature', orientation='h', title='Feature Importance - Random Forest')
fig_feature_impt.update_layout(yaxis={'categoryorder':'total ascending'})
fig_feature_impt.show()
VII. Conclusion
--TBA--
VIII. References
This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
- O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
- William H. Wolberg and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196.
- O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition via linear programming: Theory and application to medical diagnosis", in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.
- K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).